
STOP 



Early Journal Content on JSTOR, Free to Anyone in the World 

This article is one of nearly 500,000 scholarly works digitized and made freely available to everyone in 
the world by JSTOR. 

Known as the Early Journal Content, this set of works include research articles, news, letters, and other 
writings published in more than 200 of the oldest leading academic journals. The works date from the 
mid-seventeenth to the early twentieth centuries. 

We encourage people to read and share the Early Journal Content openly and to tell others that this 
resource exists. People may post this content online or redistribute in any way for non-commercial 
purposes. 

Read more about Early Journal Content at http://about.jstor.org/participate-jstor/individuals/early- 
journal-content . 



JSTOR is a digital library of academic journals, books, and primary source objects. JSTOR helps people 
discover, use, and build upon a wide range of content through a powerful research and teaching 
platform, and preserves this content for future generations. JSTOR is part of ITHAKA, a not-for-profit 
organization that also includes Ithaka S+R and Portico. For more information about JSTOR, please 
contact support@jstor.org. 



THE RELIABILITY OF SINGLE MEASUREMENTS WITH 

STANDARD TESTS 



S. A. COURTIS 
Liggett School, Detroit, Mich. 



A recent article in this Journal gave the results and conclusions 
of a study of the reliability of single measurements in the derivation 
of standard scores in adding. The tests used (Test No. i of the 
Courtis series, see Fig. i , below) measured the ability to write the 
answers to the fundamental combinations in addition. 

For the purpose and the results of the study the writer has only 
the highest praise; the article as a whole is a valuable, a suggestive, 
and a much-needed contribution to the work of standardization. 
With certain features of the method of the study, however, and 
with the conclusions drawn from the data, the writer wishes to take 
issue. In the present article he will attempt to make clear his 
reasons for believing the method faulty, and, with the aid of addi- 
tional data from certain investigations of his own, will try to so 
interpret the results obtained by the authors as to reverse their 
conclusions. 

A brief summary of the method, results, and conclusions of the 
article are presented herewith as a basis for the present discussion. 
For a really adequate explanation of essential details reference 
must, of course, be made to the original paper. 

Two hundred and seventy eighth-grade children in the eight 
larger grammar schools of San Jose, Cal., were given a practice 
series of twenty-five tests, five on each of five days. The tests 
used each day consisted of one like that shown in Fig. i and four 
slightly altered arrangements of it. These were given under care- 
fully controlled conditions to all. The paper of any individual who 
made more than two errors in any one minute was rejected (68 
cases, or 25 per cent of total number), leaving a selected group of 
202 papers, chosen as representing a reasonable degree of accuracy 
as the basis for the discussion of the article. 

326 



SINGLE MEASUREMENTS WITH STANDARD TESTS 327 

In the first test the scores varied from 28 to 88 combinations, 
a range of 60 combinations, the middle half falling between 42 and 
60. The median of the first scores was 51 combinations. The 
medians of the twenty-five scores in the 202 cases varied from 42 
to 107 combinations, a range of 65 combinations, the middle half 
falling between 62 and 80. The median of the group was 70 com- 
binations. From this data it is evident that the first score of the 




"Measure the efficiency of the entire school, not the individual ability of the few" 

ARITHMETIC-Test No. 1. Speed Test-Addition 

'Write on this paper, in the space between the lines, the answers to as many of these simple addition examples 
as possible in the time allowed. 



1 


6 


9 





4 


1 


7 


9 


3 


2 


1 


3 


6 





3 


8 


9 


7 


8 


2 


2 


6 


S 


1 


2 


3 


7 


6 





4 


S 


8 


9 


7 


2 




9 


6 





S 


1 


4 


R 





2 


3 


4 


7 





3 


1 


2 


S 


6 


7 


5 


8 


6 


9 


4 


6 


7 


9 


s 


7 


1 


6 


9 


8 


5 


4 


9 


8 





2 




3 


S 





3 


2 


9 


7 


4 


S 


1 


3 


8 


2 


3 


4 


8 


9 


S 


3 


I 


8 


6 





5 


2 


3 


S 





2 


7 


9 


5 





7 


1 


8 


7 





6 


9 


4 


7 


2 


4 


7 


4 


8 





3 


9 


2 


5 





6 


2 


4 


5 


1 


6 


3 


7 


9 





4 


1 


9 


6 





4 


1 


8 


7 


4 


3 


-L 


8 


9 





2 


3 


4 


8 


6 


S 


4 


A 


S 





7 


5 


9 


S 


7 


5 


, 


6 


7 





2 


6 


9 


8 


1 


2 


4 


2 


6 


9 


3 


5 


2 


8 





3 


8 


4 


S 


3 


6 


1 


4 


7 


1 


3 


1 


7 


9 


3 


7 


6 


9 


7 


8 


5 


, 


2 


6 





3 


1 


4 


9 





4 


1 


7 


6 





4 


1 


9 


6 





2 


6 


7 


9 


7 


2 


6 


7 


S 


1 


2 



JVm« HH ~. ■ - n ^ .School — I,,.-- ^ Grade ,...,..„ 

Fig. 1 

group as a whole is fifty-one seventieths of the median score. On 
the basis of this relation it became possible to compute for each 
individual a hypothetical first score from his median score and this 
was done. That is, on the assumption that the middle measure of 
all the first scores of 202 children represents the same relative 
position on the scale of status as the middle measure of all medians, 
the hypothetical first score for each individual represents the real 
measure of his initial ability. For it is a measure derived from 
twenty-five scores with the practice effect ehminated. The differ- 
ence between the actual and hypothetical first scores was then 
found for each individual with the following results: 



328 THE ELEMENTARY SCHOOL TEACHER 

Out of 182 cases where all twenty-five tests were taken the first 
scores deviated from the hypothetical values by between zero and 
one combination in twenty-six cases, by between one and two com- 
binations in eighteen cases, in order as follows: 

Combinations o 1 2 3 4 5 678 9 10 11 12 13 14 15 16 

Cases 26 18 25 23 17 11 9 8 12 8 7 4 4 3 o 3 

Combinations 17 18 19 20 21 22 23 24 25 26 

Cases 1 1 o o o o o 1 o 1 

The individual scores showed a large practice effect, the average 
difference between the first and last scores being 28 combinations. 
There were marked individual differences both in the total amount 
of the practice effect, in the variation in successive scores, and in 
the form of the practice curves. 

The conclusions drawn from this study were: 

1. That one performance is altogether too uncertain as a test of 
an individual for purposes of grading or of diagnosis. 

2. That twenty-five trials would be necessary to safely measure 
the ability of an eighth-grade child to write the addition combina- 
tions. 

3. That from a certain point of view the usefulness of the test 
may be questioned owing to the uncertainty as to what it measures. 

4. That the practice effect was probably to be explained not in 
terms of increased readiness of mental association but from an 
increased facilitation of neuro-muscular sort in the manipulation of 
the writing instrument. 

Taking up the discussion of these conclusions in the reverse 
order, the writer can express at once his basic criticism of the 
whole study by saying that the addition test shown was put to a 
use for which it was not intended. It is not surprising, therefore, 
that undesirable results were obtained and the suspicions of the 
authors aroused as to the usefulness of the test itself. The writer 
will accordingly attempt to make clear the real purpose of the test, 
and will try to prove that, while rightly used the test does measure a 
valuable ability, the ability generated by its repeated use is specific 
and may be largely valueless. 

The evolution of the series of tests, of which the one under dis- 
cussion is a member, has been described in previous articles in this 



SINGLE MEASUREMENTS WITH STANDARD TESTS 320 

Journal and the description need not be repeated here. 1 It is 
sufficient to say that they are the end-products of a long series of 
experiments, in the course of which most of the questions and 
criticisms that are likely to occur to a person considering the use 
of standard tests for the first time, were raised and answered. 

The purpose of the series as a whole was to enable the writer to 
study and bring under control the fundamental abilities in abstract 
work with whole numbers — the abilities represented by Test 7 of 
the series, of which column addition is a part. In the analysis 
of the mistakes of Test 7, however, the necessity for diagnostic tests 
of the simpler component abilities was soon perceived and tests 
Nos. 1 to 5 were constructed. 

For as the authors of the study show clearly, since the ability 
to add a column of figures involves the control of four or five 
elemental abilities, tests that disclose weakness in one or more of 
the components enable a teacher to concentrate his efforts on the 
exact cause of failure, and so to increase the efficiency of his teach- 
ing. What it was proposed to do then was to cross-section the 
minds of the children and to try to control the complex ability 
through control of the component elements. 

All measurements of mental facts, however, appear to differ 

from measurements of physical facts in this particular, that the 

conditions under which the measurements are made can never be 

reproduced again exactly. In the physical world the length of a 

brass rod is unchanged by the mere measurement of its length, 

although in the last analysis this is pure assumption, as the time 

element changes from one measurement to another and the time 

conditions cannot be reproduced at will. At least it is possible to 

say that the length is apparently unaffected by measurement. In 

dealing with the mental facts, however, we do not even assume 

that we can reproduce the conditions, as the mind itself is so very 

1 The subject-matter of each test is as follows: 
Test No. 1. Addition \ 

2. Subtraction f ir . . . 

■hx i*. i* *• / (Combinations, o-o) 

3. Multiplication I * 

4. Division / 

5. Copying figures (rate of motor activity) 

6. Speed Reasoning (judgment of operation to be used in simple one-step problems) 

7. Fundamentals (abstract examples in the four operations) 

8. Reasoning (two-step problems) 



330 THE ELEMENTARY SCHOOL TEACHER 

evidently affected by the measurement. To write the answers to 
the addition combinations in Test i is to increase by one the num- 
ber of times we have responded to these visual stimuli, and the law 
of habit formation leads us to suppose that the character of our 
response to that stimulation in the future will in some degree be 
affected by the increase. 

A careful examination of the situation, however, shows that 
measurement of mental facts is more complex than measurement of 
physical facts for another reason, measurement must be made 
indirectly. We compare length with length, but we cannot so 
measure readiness of response. Instead we resort to indirect 
measurement. In column addition readiness of response to a 
stimulation partly visual, partly mental is a factor determining 
ability. 

But in the attempt to measure that factor by Test i we supply a 
stimulus that is wholly visual and we really measure the resulting 
motor achievement. At least four types of factors play a part in 
determining that achievement; the purely physical, as contrast 
between printing and paper, lighting conditions, hardness of pencil, 
etc.; the purely physiological, as state of health, fatigue, bodily 
structure, etc.; the purely psychological, as emotional state, degree 
of conscious control, etc.; and the factors directly involved in the 
control of the sensori-motor machinery. In repeated testing there 
is great danger, therefore, that any change observed may be due 
to the influence of factors other than those it is desired to measure, 
and the term "reliability" as applied to mental measurements 
needs definition. 

In physical measurements, where the quality of thing measured 
is apparently not affected by the measurement, and all causes of 
variation in the result are external, the median of twenty-five 
measurements is more reliable than a single measurement, as the 
chance errors produced by the external causes are now positive, 
now negative, and tend more and more to destroy each other as the 
number of measurements is increased. In mental measurements, 
however, the fact that the thing to be measured may be changed 
by the measurement and the fact that, even if it is not changed, the 
actual achievements by which it is measured may vary because of 



SINGLE MEASUREMENTS WITH STANDARD TESTS 331 

changes in other factors, alters the situation completely. A reliable 
measure of a quality is one that accurately reflects the degree of 
that quality in the person at the time the measurement is made. 
If a practice series produces changes in the factors determining 
achievement, then twenty-five measurements or any derived value 
based upon them may be more unreliable than a single measure- 
ment. The real question of reliability as applied to a measurement 
of readiness of association in the case of the addition combinations 
resolves itself into two questions: Is readiness of association a 
sufficiently definite and stable quality to permit of its being meas- 
ured at all ? If so, is it the determining factor in the achievement 
by which it is measured ? 

It must be evident that as readiness of association cannot be 
isolated and measured directly, no final answer to either of these 
questions can be made. Yet where the same inference can be 
drawn from several different types of data, its truth becomes 
reasonably certain. 

The experiences of the writer that many children and adults 
show great constancy of performance in various situations led him 
to assume at the outset of the testing work that knowledge of the 
tables was a definite and stable quantity capable of exact measure- 
ment and only recently has he had occasion to question that opinion. 
The authors of the study under discussion evidently were of the 
same opinion and their reluctance to credit the practice effect to 
increased readiness of association would seem to show that they 
did not entirely abandon it. A priori, it would seem that an ability 
slowly built up through six or seven years of constant repetition- 
thousands of repetitions for each combination — and permanent 
enough to endure through life in spite of periods of disuse years 
long, must be both definite enough to permit of exact measurement 
and stable enough to be practically unaffected by the measurement. 

However, a-priori reasoning is unsatisfactory when more direct 
evidence can be obtained and in this case such evidence is not 
wanting. 

In the results from .measurements of groups, the individual 
variations caused by minor factors tend to destroy each other, 
leaving in high relief only those effects common to all members of 



33 2 



THE ELEMENTARY SCHOOL TEACHER 



the group. It is possible, therefore, to correlate ability to work 
abstract examples of Test 7 (the four operations) with the total of 
the scores in the first four speed tests (the addition, subtraction, 
multiplication, and division combinations) on the basis of returns 
from many large groups of children. 

TABLE I 

Relation between ability to work in the four operations, simple examples with whole 
numbers (Test 7), and knowledge of the tables (total of scores in Tests 1 to 4). 
Average scores of various grades of children from third through twelfth, in 
Boston, New York, Detroit, and of the tabulations to determine standard scores, 
55, 200 children in all. Scores for Test 7 are the number of examples attempted 
and the number right in twelve minutes. For totals, the sum of the number of 
answers per minute in each of the four speed tests (the addition, subtraction, 
multiplication, and division combinations). 





Average Scores of Group in 




Average Scores of Group in 


No. OF 








No. OF 
Children 








Chtcdke 


« 












in Gaotr 


p Attempts 


Rights 


Totals 


in Group 


Attempts 


Rights 


Totals 




Test 7 


Test 7 


1 to 4 




Test 7 


Test 7 


1 to 4 


472 : . 


5-4 


I -'7 


87 


244 


12.0 


8-3 


178 


S25- • 


5-4 


1 


7 


72 


410. . . . 


12.0 


8 


8 


18S 


2,278. . 


■ 5-6 


1 


4 


86 


5,670'. . . . 


12-5 


7 





176 


345- ■ 


6.2 


2 


4 


95 


2,129. . . . 


12.5 


7 


4 


182 


481. . 


. 6.4 


2 


7 


102 


264. . . . 


12.5 


7 


8 


179 


1,222. . 


6.6 


3 


6 


102 


405 


12.8 


9 





177 


2,055- • 


. 6.9 


2 


8 


109 


209. . . . 


12.8 


8 


9 


185 


3SO- • 


7-5 


3 


9 


117 


1,370 


131 


8 


9 


189 


S.S20- . 


• 7-8 


4 





137 


328.... 


13-5 


9 


2 


197 


276. . 


7-8 


4 


8 


137 


412 


13-7 


9 


5 


198 


53°- - 


8.0 


5 


9 


136 


4,771. . . . 


14.0 


8 


5 


194 


5,390- . 


8.8 


4 


2 


128 


216. . . . 


14.0 


9 


5 


191 


1,177. . 


9.0 


5 


3 


130 


151.. .. 


14.4 


9 


4 


198 


2,710. . 


9.2 


4 


7 


135 


368.... 


14-5 


9 


8 


209 


476. . 


9-5 


6 


1 


153 


169. . . . 


14.9 


10 


8 


202 


484.. 


9.9 


7 


6 


165 


179. .. . 


15-4 


10 


5 


232 


335- • 


10. 


6 


3 


149 


4,502 


15-7 


10 


1 


219 


1,282. . 


• 10.3 


6 


9 


152 


440 


15-7 


10 


9 


224 


260. . 


10.8 


7 


3 


151 


120. .. . 


16.0 


11 





230 


425- • 


. 10.8 


6 


5 


161 


257 


16. 1 


11 


5 


227 


2,518. . 


10.9 


6 


5 


163 


464 


16.8 


12 


6 


233 


5,836. . 


10.9 


5 


8 
6 


157 
167 


131 


17. 2 


II. 8 


242 


1,432 • • 


n. 5 


7 


















55,200 









The writer has recently had opportunity to carry through, with 
a trained force of examiners and with mechanical timing, tests of 
school children in Detroit (2), New York, and Boston. The 
Boston test was made at the beginning of the year (October), the 
first Detroit test in January, the second in June, the New York 
test in April. He has also the returns from the investigation to 



SINGLE MEASUREMENTS WITH STANDARD TESTS 333 

determine standard scores, representing between three hundred and 
four hundred classes in sixty to seventy schools in ten states. The 
average scores made by each group of children examined (eighth 
grade, seventh grade, sixth grade, etc.) in each of the two traits 
are given in Table I, arranged without respect to either city or 
grade, but solely on a basis of size of scores in Test 7. The number 
of children in each group is also given. All grades from third to 
twelfth are represented and a few scores are from normal-school 
students. Returns from rather more than 55,000 children in 45 
divisions are represented in the table. 

It is evident both from the table and from the graphic represen- 
tation of the data (Fig. 2) that the correlation between speed work 
(ability to attempt examples) and knowledge of the tables is posi- 
tive and high. (Pearson coefficient of correlation = +0 . 98.) That 
is, in general the ability to complete any number of examples in 
Test 7 in a given time (say eleven examples in the twelve minutes 
allowed for this test) occurs with a corresponding score in knowl- 
edge of the tables (in this case 160 combinations in four minutes) 
without respect to either grade, city, or time of year. For accuracy 
(number of examples right) the correlation is also positive but 
slightly lower. Given the score made in one of these tests by any 
large group of children, the score made in the other tests can be 
computed by the formula: C=i3.3^4+i3.3 and 0=15.4^+55 in 
which C is the total score in the combinations, A the score in num- 
ber of examples attempted, and R the number of examples right. 
Weighing the various results in proportion to the number of chil- 
dren in the groups, the average deviation of the actual from the 
computed values are ^ = 3.2, C R = 9.1. That is, in general, 
knowledge of the tables, or readiness of response, determines speed 
of work and to a lesser degree accuracy also. 

In considering these results the reader should be careful not to 
infer too much. It is quite certain that each of the two traits is a 
definite measurable quantity, that they have a functional relation 
to each other in that for every value of one there is a corresponding 
value of the other, but the causal relation of the two is by no means 
proven. Or if this relation be granted, these results alone do not 
show which of the two is the cause and which the effect. It may 



334 



THE ELEMENTARY SCHOOL TEACHER 



be that a child in whom has been built up a very great control over 
the fundamental number associations may for that reason be able 

TEST 7 



in 












lo 
















17 












°I 




16 
















15 










o 


aJr 
/ ° 




14 










"^ 






13 










of 


« . 


/Rts. 


l/>- 


























© 






1 1 








>gP 




9 «/d 
















fc\ 












& 


»l 






9 










y? 






o 
7 




/& 


a 


9 /y 
















'b 








5 


e> Jo 




°- a 










4 




f 












3- 




B 












2 






























l 


6 
















l( 


)o' 


tt 


o 


2 


io ^ 


30" 



TOTALS, TtSTS 1-4 

Fig. 2. — -Relation between ability to work abstract examples in Test 7 and 
Knowledge of the Tables on basis of scores of 55,000 children in Detroit, Boston, New 
York, and in 60 to 70 schools in ten states. Ability to attempt examples is almost 
exactly proportional to degree of knowledge of tables. The relation is less exact in 
case of examples right. Light lines represent the scores previously selected as standard. 

to attempt a large number of examples. But it may also be that 
it is the ability to work a large number of examples which enables 



SINGLE MEASUREMENTS WITH STANDARD TESTS 



335 



him to make a high score in the tables. It might even be that the 
two have no direct relation, both being but expression in different 
ways of the same basic facts — -a retentive memory, a short reaction 
time, and a perfect muscular control. Whatever the explanation, 
of the facts themselves there is no question. One of the most 
striking results of the testing work has been the marvelous agree- 
ment, both in average score and in range of distribution, of results 
from schools in widely separated localities. Whether returns are 
received from a small public school in Virginia, a country school in 
Kansas, a private school in a northern state, or a large public school 
in New York City, a seventh-grade class will make about the same 
average scores in the various tests. Slightly higher scores in some 
tests, it is true, do occur but at a cost of lower scores in other tests. 
The product of teaching in arithmetic seems to be determined 
mainly by factors that must be common to all schools rather than 
by the artificial differences created by teachers, methods of work, 
and courses of study. 

In this connection it is interesting to compare the scores selected 
as standards from the measurement of a few thousand children in 
many schools — mainly schools in smaller cities and towns — with 
the relations shown in the graph which were derived from six times 
as many children from a few school systems. The standard scores 
are represented by the light lines in the figure. It is evident that 
the standard scores are not far wrong even in "Rights," where the 
greatest differences occur. 

TABLE II 



Distribution of scores in the tables (Tests 
of whom had a score of 


1-4) made by a group of 
[2 examples right in Test 


928 children 
7 


, each 


Scores in tables 


90 


120 


iS° 


180 


210 


240 


270 


300 


33° 


360 


Number making score . 


2 


25 


169 


293 


236 


139 


68 


18 


6 


2 



The reader should be careful, also, to remember that a relation 
may be true in general but not at all true of certain individuals, and 
that an average score hides the range of individual variation. In 
Table II, for instance, is given the distribution in one of these traits 
(total score of four speed tests) of a group of children having the 



336 



THE ELEMENTARY SCHOOL TEACHER 



same score in the other trait (Test 7, rights). Fig. 3 shows the same 
facts graphically. Although these 958 children all had exactly 
twelve examples right in Test 7 and should by the relation in Fig. 2 
have a score of 240 in the tables, their actual scores range from 90 
to 360. 

That is, the children at one end of the distribution have four 
times the equipment in knowledge of the tables of the children at 
the other end, yet were able to get no more examples right. 



3004 



200- 



100" 




150 £10 270 330 390 
TOTALS -TESTS 1-4 

Fig. 3. — Distribution of scores in the tables (Tests 1-4) made by a group of 958 
children, each of whom had a score of 12 examples right in Test 7. Average score 
= 201. 



The mode is at 180 to 210, and in Table III is given the distri- 
bution in Test 7 of about 5,564 children, all of whom have a score 
of 180 to 210 in the tables. The range here is greater than before, 
some children failing in every example attempted, others having a 
perfect score of 19 examples right. Moreover the average falls, not 
at 1 2 as it should by the previous figure, but at 8 . 1 , more nearly the 
point in accord with the computed value. 

A similar range of variation has been found in all schools so far 
examined, and the necessity for, and the value of, a study of the 



SINGLE MEASUREMENTS WITH STANDARD TESTS 



337 



reliability of the first scores must be apparent at once. The inter- 
pretation to be placed upon such results depends almost wholly 
upon one's opinion as to the reliability of "first" scores. The 
writer feels, however, that on the basis of the evidence presented 

TABLE III 

Distribution of scores in Test 7, rights made by a group of 5,564 children, each of 
whom had a score of 180 to 210 in the tables 



Scores in 










































Test 7, 










































rights. . . 





1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


11 


12 


13 


14 


15 


16 


17 


18 


19 


Number 










































making 










































score. . . . 


189 


137 


222 


252 


366 


422 


495 


534 


59« 


559 


499 


418 


293 


264 


141 


9° 


49 


29 


6 


1 




5 10 15 

TEST 7, RIGHTS 



zo 



Fig. 4. — Distribution of scores in Test 7, rights made by a group of 5,564 children, 
each of whom had a score of 180 to 210 combinations. Average score, 8.1 examples. 

above it is possible to answer in the affirmative the first of the two 
questions proposed. Readiness of association in the case of the 
fundamental combinations is a sufficiently definite and stable trait 
to admit of measurement. The second question may then be 
discussed. 



338 



THE ELEMENTARY SCHOOL TEACHER 



If readiness of association is in itself a stable quantity, and 
further, if it is the determining factor in the achievement by which 
it is measured, then repeated tests of an individual ought to show 
only slight fluctuations above and below a certain constant value. 
If, on the other hand, it is not a determining factor, all sorts of 
variations are possible. Further, since measurement of the abilities 
of many school children show in general a constant relation between 



"Afcoawre (ne efficiency of tin anlira acAon/, net (Ac individual ability oflka taut" 




INSTRUCTIONS 

Fiod on the proper ecele for emch teat the point corresponding to the (rede everede or individual eeore lor thet teet. end 
■otn point to point. The curve of eay (rede in e "Good" echool or ol e "Good" echoler in the (rede will lie wholly within 
the boundnei for thet (rede. Excellence ii shown by helnnce ee well ee by hi(h ecoree. 

Tbe .celt* art baaed ubso "Standard Seeres" derived Irani rae meaeartnurat nl eiaa tbeaaaad childraa in aiaty different aabnela ia lae different 



Fig. 5. — Comparative Graph Sheet, a device for comparing individual scores, or 
class averages, with standard scores. A. represents an ideal curve of a standard indi- 
vidual, B the actual curve of the most nearly standard individual found among 1,500 
twelve-year-old children. 

scores in Test 7 and knowledge of the tables, this and the foregoing 
can be used as criteria in judging particular cases. The results of 
the study being discussed show clearly that great changes in scores 
are produced by repeated. measurement and the writer is disposed 
to agree with the authors in assigning the practice effect to changes 
in factors other than the one to be measured. The ability generated 



SINGLE MEASUREMENTS WITH STANDARD TESTS 330 

by practice is the specific ability to write answers to the particular 
test and not increased readiness of association. For the latter 
would presumably function in Test 7, while the former does not. 

To show this, a device adopted by the writer for making graphic 
use of the scores of either individuals or classes for diagnostic pur- 
poses must be explained. It is called a comparative graph sheet and 
is illustrated in Fig. 5. Along the horizontal lines representing the 
various tests scales are so drawn that the standard score for each 
grade falls directly below the grade. As a result the curve of a 
standard grade or individual is a straight line as shown at A . 

Such balance of development is seldom found and B represents 
the most nearly perfect curve found, in the examination of the 
scores of about 1,500 twelve-year-old boys and girls. 

In Fig. 6 is shown the specific character of the ability generated 
by repeated uses of the tests as drill exercises. Curves A and B are 
from two eighth-grade sections in a large city school. Section A 
worked directly upon Tests 1-4 throughout the year and large 
practice effect and high average scores in the tests of the elemental 
combinations resulted. It is to be noted, however, that in Test 7, 
in which these elemental abilities are put to use, the curves of the 
two sections agree. 

That is, the higher scores in the elemental tests in the four 
operations that were made by Section A do not mean greater 
ability to work more abstract examples in a given time. Precisely 
similar results can be shown for other classes and for individuals 
where the tests have been misused in this same way. 

It should be apparent, therefore, that the large practice effects 
obtained in the study under discussion probably have the signifi- 
cance that the authors attach to them; that is, the larger gains 
were caused by greater speed in "getting started," more rapid 
writing due, as the authors suggest, to degeneration of the form of 
the figures written, and to other changes of a similar nature, not 
to greater knowledge or control of the proper associations. In a 
"first" test there is a certain cautious watchfulness, a guarding 
against surprises. The child works as rapidly as he can but his 
attention is on the arithmetical phase of the work. At any subse- 
quent test, however, he knows what to expect. He can take his 



34© 



THE ELEMENTARY SCHOOL TEACHER 



attention from the arithmetical work and bend all his energies to 
securing speed. In the writer's experience a second test shows 
quite uniformly a rise in scores of from 10 to 15 per cent. Repeating 
the test twenty-five times in five days would afford ample oppor- 
tunity for such improvement, particularly if the children were by 
the proper phrasing of the instruction given a strong incentive. 



"JtfMtor* trW errTciancy of f IU cnlir* «cAW, net thi inJii/tJaml aiilitr •fthm/mf" 



City .. 



COURTIS STANDARD TESTS 

Arithmetic 



— B 



Coaparativa 
Graph Shaat 



Grade* 

Teid 

1 . Addition 

2. Subtraction 

3. Multiplication 

4. Division 



Copying; Figures 
Speed Reasoning 



Right* 

Fundamentals 
Atteroptt 



Righti 

Reasoning 

AttempU 



T 2 — i_ 3—J— 4— |— 5— |— 6— J 7— p-*Z±~ 9— I 
M I 30 '40 I « , ^ *o I 1 * Te^^ 

1 I 1 1. 1 I i t 1 I ti n 1 1 1 1 1 1 /Ti l 



Rights q 




INSTRUCTIONS 

Find on the proper acale for each teat the point corresponding to the grade average or individual score (or that test, and 
join point to point. The curve of any grade in a "Good" school or of a "Good" scholar ia the trade will lie wholly within 
the boundriea for that grade. Excellence ia ihown by balance aa well at by high scores. 

Tin Kilo art b**td upuo "Staadard Searet" derived lroai ibe mmnanl »( aia* Ihoaeead children ia eiitv different xbaol* is tea differeei 



Fig. 6. — Comparison of grade averages of two sections of an eighth-grade class 
in a large city school. Section A was drilled on Tests 1-4 through the year. Section 5 
was not practiced. The fact that the scores agree in Test 7 shows that the ability 
generated by the repeated use of the tests was specific, and did not transfer. 

To the writer, therefore, the study seems to yield more of a 
measure of the amount of change in the achievement that can be 
produced by practice than any data bearing upon the reliability 
of the first scores as measures of the readiness of association, and as 
such evidence as that presented above proves that the increase in 
skill does not transfer to practical work, he is still of the opinion 
that the first scores are more reliable as a base for detecting indi- 



SINGLE MEASUREMENTS WITH STANDARD TESTS 341 

vidual defect than the twenty-fifth score or the median of the 
twenty-five scores. 

It is particularly to be noted that in a general way the variation 
within the class was little changed by the practice. By the first 
scores the extreme range of the class was 60 combinations (88—28), 
the range of the middle half of the group being 18 combinations. 
The extreme range of the medians of the twenty-five scores was 65 
combinations (107—42), roughly a 10 per cent increase, while the 
range of the middle half was still 18 combinations (80—62). It is 
to be noted also that the extreme range of the class was 60 com- 
binations, and was more than twice the average practice effects of 
the twenty-five trials. To the writer, these facts all have the same 
explanation. 

In the first scores and in all subsequent scores the determining 
factor in the differences between the achievements of different 
individuals is the readiness of associations. Under practice, how- 
ever, the absolute scores change for causes and by amounts which 
vary from individual to individual. The greater the practice, the 
greater the differences produced by the practice effects. The first 
scores, therefore, in which the effects of other factors are at a mini- 
mum, measure more reliably than any other actual difference in the 
readiness of association. 

In this connection it may be well to state for the benefit of all 
the writer's idea of the proper use of the tests. They are not a 
method of instruction. The first six of the eight tests cover abilities 
which should be developed through oral work and never through 
written practice as such. Neither are the tests examinations for 
promotion. The abilities covered are too simple and the conditions 
of the testing too artificial to be of value from this point of view. 
The tests are, however, comparative rulers for arithmetic, and if 
given not more than four times a year, reflect accurately the great 
complex changes produced by school work. The results are of 
value, therefore, to all who are making a critical study of school 
conditions, whether from the point of view of administration — the 
determination of efficient methods, the comparison of efficiency of 
teaching from school to school, or system to system, characteristics 
of the course of study, etc. — or from the point of view of the 



342 THE ELEMENTARY SCHOOL TEACHER 

teacher of the individual child. The fundamental idea of the sys- 
tem, however, is that of comparison on basis of a sample obtained 
under set conditions. First scores, as have been pointed out, are 
proved by this very study to represent more nearly the existing 
conditions than any other scores, and if the tests are given only 
at long intervals, all the complications due to the introduction of 
other factors are avoided. 

Before leaving the discussion of the nature of the practice effect, 
it will be necessary to consider the effect of the rejection of the 
papers of the inaccurate workers. The authors express much sur- 
prise at their discovery that papers having high scores were in 
general free from mistakes. The writer made the same discovery 
some years ago and it is written in the folder of instructions to 
scorers (Folder C), in the manual, and in other discussions of the 
same sort. The statements, however, probably need revision, as 
in saying that "Courtis regards them [the errors] as negligible in 
these tests" (p. 96), the authors show they have failed to get the 
idea the writer intended to express. From the scoring of many 
hundreds of papers it has been determined that in general the 
errors in the addition test will average from 1 per cent to 3 per cent 
of the answers written, and in general that it would be better to 
expend in other ways the time and effort needed to detect this 
small number of errors 

The purpose of the search for errors is to discover those who do 
not know their tables, but, since "scores containing many errors did 
not average as many combinations as scores without them, which 
seems to point to some third factor as being responsible for errors 
and smaller scores alike," it is possible to detect such children by 
their small scores and there is no need to search for mistakes. The 
authors call the factor determining both errors and small scores 
the "predisposition of the moment" [the italics are mine], but else- 
where the writer has arrayed the data that prove this factor is the 
basic factor in the learning process — the specialization of the mental 
abilities of the individual by the forces of heredity. Measurement 
of whole families, of twins, of the behavior of twins under practice, 
etc., proves that each individual responds selectively to school 
training on the basis of his natural aptitudes. One child can learn 



SINGLE MEASUREMENTS WITH STANDARD TESTS 343 

addition readily but is unable to master subtraction, the next does 
well in subtraction but cannot learn addition. The authors speak 
of an "observed erratic character of their performance" in discuss- 
ing the elimination of those making small scores. The writer, 
curious to see for himself the character of such performance, gave 
the next day after reading the article five successive tests to a girl 
with a very marked and stubborn weakness in division, a girl who 
after two years of special work still averages four errors per minute. 
Her scores in order were 46 (first), 42, 30, 34, 28 (fifth). The 
decline in the scores is due almost wholly to the fatigue caused 
by the great effort that for her is necessary to do such work. In the 
study, therefore, the elimination had the effect of selecting those 
who by nature were able to respond to the kind of practice they 
were to undergo and even this selected group responded in very 
different ways to the practice series, as is shown by the increase in 
the extreme range, previously noted, and by the differences in the 
sample practice curves shown. Had the results in the entire 
unselected group been used, these differences would have been more 
marked and the hopelessness of attempting to eliminate the indi- 
vidual practice effect on the basis of a general idealized practice 
curve more apparent. For any discussion of the individual, the 
results from that individual alone must be used — his behavior is 
an individual matter. For general laws of mental behavior, results 
from large groups eliminate all the idiosyncrasies of individuals. 
The writer, therefore, does not regard the "hypothetical first 
scores" as being in any way the true measures of the initial ability 
of individuals. 

The data of the study, however, may be made to yield informa- 
tion as to the size of the chance errors in such mental measurements. 
In Fig. 7, the figure on p. 98 of the article is reproduced, but through 
each curve a line has been drawn to represent the growth in ability, 
the practice curve of the individual. The actual scores in the suc- 
cessive tests fall now above and now below the practice curves in 
the usual manner of measurements containing chance errors. That 
is, the writer interprets the curves to mean that any one score is 
determined by two major factors (assumed for the present to be 
"readiness of association" and "practice effect") and by a number 



344 



THE ELEMENTARY SCHOOL TEACHER 



of minor factors. The "predisposition of the moment," in the 
sense of the combination or opposition of the minor factors, deter- 
mines the variation of the score. By scaling the scores from the 
curves and finding the differences between the ideal and the actual 




Fig. 7. — Three individual practice curves and graphs of the actual scores, showing 
variation caused by chance errors. 



values, the writer secured an average deviation of 2.1 from the 
seventy-two differences. Even if the difference between each score 
and the next is used as a measure of the effect of the chance errors, 
the average variation would be increased to but 3.7. The distribu- 
tion of these differences is as follows: 



SINGLE MEASUREMENTS WITH STANDARD TESTS 



345 



DISTRIBUTION OF DIFFERENCES SHOWING THE NUMBER OF CASES 

OF EACH SIZE 



Size of deviations 


o 


i 


2 


3 


4 


5 


6 


7 


8 





10 


II 


....18 


Average 
Difference 


Variation from 
curve 


ii 
6 


18 
14 


19 

12 


10 
S 


7 
14 - 


6 
8 




2 


I 
6 














Variation from 
preceding 
score 


I 


I 


2 


I 


... I 


3-7 



The writer is sorry not to have the full data from all the curves, 
including those of the children who made many errors. It is to be 
hoped the authors will supply the deficiency. 

The significance of deviations of this order will be discussed 
below. 

Turning now from the discussion of the results of the study, the 
writer will present data of his own bearing on the nature of the 
relation between scores in Test 1 and scores in column addition 
(conclusion three, above), other data showing the effect of chance 
errors, and will comment on the diagnostic interpretation of indi- 
vidual curves on the comparative graph sheet. Until such a com- 
plete discussion of the facts in the case has been presented, he will 
be unable to take up the first two conclusions drawn from the study. 



